fix(dflash): auto-detect GPU arch to prevent sm_120a on consumer Blackwell by easel · Pull Request #48 · Luce-Org/lucebox-hub

easel · 2026-04-27T19:52:55Z

Problem

On CUDA 13.2+ with consumer Blackwell GPUs (for example, RTX 5090, SM 12.0), using an unset
CMAKE_CUDA_ARCHITECTURES or native can resolve to sm_120a instead of
sm_120, which can trigger CUDA_ERROR_ILLEGAL_INSTRUCTION at runtime on
consumer hardware.

Fix (auto-detect only)

At configure time, if CMAKE_CUDA_ARCHITECTURES is unset or native, run
nvidia-smi --query-gpu=compute_cap --format=csv,noheader.
Parse the compute capability (for example 12.0) and set
CMAKE_CUDA_ARCHITECTURES explicitly (for example 120).
Keep the change isolated to dflash/CMakeLists.txt so it can be reviewed and
merged independently from consumer-specific workaround behavior.

Test plan

cmake -B build -S dflash/ prints dflash27b: GPU compute_cap 12.0 → CUDA_ARCHITECTURES=120 on Blackwell hardware.
cmake --build build succeeds without CUDA-arch related compiler/runtime errors on a Blackwell consumer system.

davide221 · 2026-04-28T08:31:13Z

@easel thanks for the contribution! Is the speed problem still present ?

easel · 2026-04-28T22:09:48Z

@easel thanks for the contribution! Is the speed problem still present ?

Yes. I think it's related to the workflow -- I'm putting together a small benchmark script to compare.

easel · 2026-05-04T23:23:51Z

This may not be necessary if expectation is to always build multi-arch binary. I ran into it because claude got excited about optimizing and ended up with a slightly incompatible build.

Two CMake-side rough edges that bit me on Windows MSVC + CUDA 12.x on RTX 6000 Ada (sm_89, Ada-only): 1. CUDA architectures: when no explicit override is provided, the previous CMakeLists could fall back to `75;86`, which caused silent build issues on Ada-only setups. This change respects DFLASH27B_USER_CUDA_ARCHITECTURES (e.g. `89`) and uses it consistently across the dflash and submodule ggml/llama.cpp consumers. 2. BSA was sometimes silently disabled depending on detection order. DFLASH27B_ENABLE_BSA is now respected as an explicit opt-in/opt-out and a clear status line is printed at configure time. Net effect: a single-arch Ada-only build with BSA enabled is reproducible from a clean checkout. Default behaviour (no DFLASH27B_USER_CUDA_ARCHITECTURES set, BSA on) is preserved for existing users. Validation: cmake -S dflash -B dflash/build/Release \ -DCMAKE_BUILD_TYPE=Release \ -DDFLASH27B_USER_CUDA_ARCHITECTURES=89 \ -DDFLASH27B_ENABLE_BSA=ON cmake --build dflash/build/Release --target test_dflash --parallel 8 -> BUILD_EXIT_CODE=0, sm_89 single-arch confirmed. Verification vs existing community PRs: COMP-COMPL with Luce-Org#48 ("auto-detect GPU arch to prevent sm_120a on consumer Blackwell", open) and Luce-Org#91 ("expose BSA config as CLI flags with safety warnings", merged 2026-05-04). Luce-Org#48 covers auto-detect; Luce-Org#91 covers runtime CLI. This PR covers the build-time CMake side: respect the user's explicit DFLASH27B_USER_CUDA_ARCHITECTURES override and keep DFLASH27B_ENABLE_BSA honest. The three PRs together give sensible defaults per hardware tier. Author: Javier Pazo <xabicasa@gmail.com>

easel mentioned this pull request Apr 27, 2026

fix(ggml-cuda): skip sm_120→sm_120a for consumer Blackwell (no FP4 MMA) Luce-Org/llama.cpp-dflash-ggml#3

Merged

3 tasks

easel force-pushed the fix/consumer-blackwell-auto-detect branch from 2f759e7 to 653df28 Compare April 27, 2026 20:07

This was referenced Apr 27, 2026

feat(server): derive model name from GGUF; default port 1236, ctx 128K #47

Closed

refactor(server): proper Python package with tool-calling, tests, uv setup #43

Closed

easel force-pushed the fix/consumer-blackwell-auto-detect branch from ffdc796 to e6dc0cd Compare April 29, 2026 00:30

easel mentioned this pull request May 4, 2026

fix(dflash): auto-detect CUDA architecture for Blackwell #98

Closed

fix(dflash): auto-detect CUDA architecture for Blackwell

858b84b

easel force-pushed the fix/consumer-blackwell-auto-detect branch from e6dc0cd to 858b84b Compare May 4, 2026 20:24

javierpazo mentioned this pull request May 9, 2026

chore(dflash): enforce sm_89 user override and keep BSA enabled #137

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(dflash): auto-detect GPU arch to prevent sm_120a on consumer Blackwell#48

fix(dflash): auto-detect GPU arch to prevent sm_120a on consumer Blackwell#48
easel wants to merge 1 commit intoLuce-Org:mainfrom
easel:fix/consumer-blackwell-auto-detect

easel commented Apr 27, 2026 •

edited

Loading

Uh oh!

davide221 commented Apr 28, 2026

Uh oh!

easel commented Apr 28, 2026

Uh oh!

easel commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

easel commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix (auto-detect only)

Test plan

Uh oh!

davide221 commented Apr 28, 2026

Uh oh!

easel commented Apr 28, 2026

Uh oh!

easel commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

easel commented Apr 27, 2026 •

edited

Loading